feat: drain and volume detachment status conditions #1876

jmdeal · 2024-12-11T20:26:11Z

Fixes #N/A

Description
Adds status conditions for node drain and volume detachment to improve observability for the individual termination stages. This is a scoped down version of #1837, which takes these changes along with splitting each termination stage into a separate controller. I will continue to work on that refactor, but I'm decoupling to work on higher priority work.

Status Conditions:

Condition	Unknown	False	True
`Drained`	Karpenter hasn't attempted to drain the Node yet or the node is currently draining (marked with reason)	N/A	Karpenter has successfully drained the node and will proceed with the termination flow.
`VolumesDetached`	Karpenter hasn't checked for volume attachments yet or there are volume attachments which are currently blocking termination. This won't transition out of unknown until `Drained` transitions to true.	There are blocking volume attachments, but the nodeclaim's TerminationGracePeriod has elapsed. This state is only possible if TGP is configured.	All blocking volume attachment objects have been deleted.

How was this change tested?
make presubmit

By submitting this pull request, I confirm that my contribution is made under the terms of the Apache 2.0 license.

coveralls · 2024-12-11T20:47:28Z

Pull Request Test Coverage Report for Build 13248558350

Warning: This coverage report may be inaccurate.

This pull request's base commit is no longer the HEAD commit of its target branch. This means it includes changes from outside the original pull request, including, potentially, unrelated coverage changes.

For more information on this, see Tracking coverage changes with pull request builds.
To avoid this issue with future PRs, see these Recommended CI Configurations.
For a quick fix, rebase this PR at GitHub. Your next report should be accurate.

Details

82 of 136 (60.29%) changed or added relevant lines in 4 files are covered.
14 unchanged lines in 3 files lost coverage.
Overall coverage decreased (-0.3%) to 81.029%

Changes Missing Coverage	Covered Lines	Changed/Added Lines	%
pkg/utils/node/node.go	0	5	0.0%
pkg/controllers/node/termination/controller.go	71	120	59.17%

Files with Coverage Reduction	New Missed Lines	%
pkg/controllers/node/termination/controller.go	3	64.06%
pkg/controllers/disruption/consolidation.go	4	88.55%
pkg/controllers/provisioning/scheduling/preferences.go	7	86.52%

Totals
Change from base Build 13211152753:	-0.3%
Covered Lines:	9179
Relevant Lines:	11328

💛 - Coveralls

engedaam · 2024-12-12T17:18:44Z

/assign @engedaam

pkg/controllers/node/termination/controller.go

github-actions · 2025-01-02T12:02:03Z

This PR has been inactive for 14 days. StaleBot will close this stale PR after 14 more days of inactivity.

jmdeal · 2025-01-11T01:45:31Z

/remove-lifecycle stale

pkg/controllers/node/termination/controller.go

jmdeal · 2025-01-15T20:03:13Z

/hold

engedaam · 2025-01-15T20:05:58Z

/lgtm

engedaam

/lgtm

k8s-ci-robot · 2025-01-15T21:33:48Z

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by: engedaam, jmdeal
Once this PR has been reviewed and has the lgtm label, please assign jonathan-innis for approval. For more information see the Code Review Process.

The full list of commands accepted by this bot can be found here.

Needs approval from an approver in each of these files:

OWNERS

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

engedaam

/lgtm

pkg/controllers/node/termination/controller.go

jonathan-innis · 2025-01-24T00:12:51Z

pkg/controllers/node/termination/controller.go

-	})
+	if !nodeClaim.StatusConditions().Get(v1.ConditionTypeDrained).IsTrue() {
+		stored := nodeClaim.DeepCopy()
+		_ = nodeClaim.StatusConditions().SetTrue(v1.ConditionTypeDrained)


Should we use the modified check here as well? Would probably work just as well as the IsTrue check

I didn't because we knew it always would be and this limited the scope of stored, I can go either way though. The alternative would be:

stored := nodeClaim.DeepCopy() if nodeClaim.StatusConditions().SetTrue(v1.ConditionTypeDrained) { // remaining logic }

I don't really think it tangibly matters -- good either way, just thought it was odd that we didn't keep consistent style here

Maybe throw a comment over this to explain why it's coded this way? I found myself guessing about it again as I was reading through it -- had to re-read the context to understand that this is because of the patch

pkg/controllers/node/termination/controller.go

jonathan-innis · 2025-01-24T00:19:28Z

pkg/controllers/node/termination/controller.go

-			// 404 = the nodeClaim no longer exists
-			if errors.IsNotFound(err) {
-				continue
+	if volumesDetached {


Is there a way to cleanup some of this logic so that it's not so nested -- comments might also help too -- the fact that we fall-through when volumes are detached and are able to get to the bottom of the function (we don't requeue) is a bit confusing IMO

I've messed around with a few different ways of structuring this, but this ends up being the most readable IMO. I will add some comments to call out the fall-through behavior. FWIW, this does clean up in my larger refactor PR.

pkg/controllers/node/termination/terminator/events/events.go

k8s-ci-robot · 2025-01-27T05:33:28Z

New changes are detected. LGTM label has been removed.

jmdeal · 2025-02-10T20:00:28Z

/hold cancel

jonathan-innis

Close to me! Just a few small things! Nice work!

jonathan-innis · 2025-02-11T23:16:13Z

pkg/utils/node/node.go

@@ -64,6 +64,7 @@ func IgnoreNodeClaimNotFoundError(err error) error {
 // DuplicateNodeClaimError is an error returned when multiple v1.NodeClaims are found matching the passed providerID
 type DuplicateNodeClaimError struct {
 	ProviderID string
+	NodeClaims []string


Suggested change

NodeClaims []string

NodeClaimNames []string

jonathan-innis · 2025-02-11T23:16:32Z

pkg/controllers/node/termination/terminator/suite_test.go

@@ -60,6 +60,7 @@ func TestAPIs(t *testing.T) {
 }

 var _ = BeforeSuite(func() {
+	fakeClock = clock.NewFakeClock(time.Now())


Why did you add the fakeClock here -- doesn't look like it is being used anywhere in this testing?

It's passed down to the terminator and was causing nil panics. We could probably scope it down and just pass it there, but I don't mind keeping it test global so it's obvious to use the fake clock passed to the terminator rather than a new one in any future tests that use it.

jonathan-innis · 2025-02-11T23:27:34Z

pkg/controllers/node/termination/controller.go

-	})
+	if !nodeClaim.StatusConditions().Get(v1.ConditionTypeDrained).IsTrue() {
+		stored := nodeClaim.DeepCopy()
+		_ = nodeClaim.StatusConditions().SetTrue(v1.ConditionTypeDrained)


Maybe throw a comment over this to explain why it's coded this way? I found myself guessing about it again as I was reading through it -- had to re-read the context to understand that this is because of the patch

jonathan-innis · 2025-02-11T23:29:07Z

pkg/controllers/node/termination/controller.go

-			// 404 = the nodeClaim no longer exists
-			if errors.IsNotFound(err) {
-				continue
+	if len(pendingVolumeAttachments) == 0 {


Given how large this volume attachment code is now, what do you think about moving this into a separate function?

I've been resistant to that because I find the requeue logic to be significantly harder to follow as soon as you move it into a separate function - I had a similar issue when I tried to structure this as subreconcilers. I had been punting those architectural changes for my larger refactor PR. If you don't feel strongly about it I'd rather punt for that, since I think the longer reconcile function is a better than obfuscating the control flow, but let me know what you think.

jonathan-innis · 2025-02-11T23:31:25Z

pkg/controllers/node/termination/controller.go

+		if errors.IsConflict(err) {
+			return reconcile.Result{Requeue: true}, nil
+		}
+		return reconcile.Result{}, fmt.Errorf("ensuring instance termination, %w", err)


We don't seem to handle the NotFound error that we get back from this EnsureTerminated call in the same way as we do elsewhere here

jonathan-innis · 2025-02-11T23:33:34Z

pkg/controllers/node/termination/suite_test.go

@@ -908,3 +957,13 @@ func ExpectNodeWithNodeClaimDraining(c client.Client, nodeName string) *corev1.N
 	Expect(node.DeletionTimestamp).ToNot(BeNil())
 	return node
 }
+
+func ExpectNotRequeued(result reconcile.Result) {


These are two functions that I think could be generally useful and I would consider putting in operatorpkg

Good call, I've opened a PR: awslabs/operatorpkg#130

k8s-ci-robot · 2025-02-21T01:54:59Z

PR needs rebase.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

k8s-ci-robot added the cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. label Dec 11, 2024

k8s-ci-robot requested review from engedaam and tallaxes December 11, 2024 20:26

k8s-ci-robot added the size/L Denotes a PR that changes 100-499 lines, ignoring generated files. label Dec 11, 2024

jmdeal force-pushed the feat/termination-conditions branch from 4bb4d97 to fb3ac47 Compare December 11, 2024 20:31

jmdeal mentioned this pull request Dec 11, 2024

[WIP] deps: bump sigs.k8s.io/karpenter aws/karpenter-provider-aws#7520

Closed

3 tasks

k8s-ci-robot assigned engedaam Dec 12, 2024

njtran reviewed Dec 18, 2024

View reviewed changes

github-actions bot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Jan 2, 2025

k8s-ci-robot removed the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Jan 11, 2025

engedaam reviewed Jan 14, 2025

View reviewed changes

jmdeal force-pushed the feat/termination-conditions branch 2 times, most recently from b527992 to 21176e1 Compare January 15, 2025 20:02

k8s-ci-robot added the do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. label Jan 15, 2025

k8s-ci-robot added the lgtm "Looks good to me", indicates that a PR is ready to be merged. label Jan 15, 2025

jmdeal mentioned this pull request Jan 15, 2025

[WIP] deps: bump kubernetes-sigs/karpenter aws/karpenter-provider-aws#7599

Closed

3 tasks

k8s-ci-robot removed the lgtm "Looks good to me", indicates that a PR is ready to be merged. label Jan 15, 2025

engedaam approved these changes Jan 15, 2025

View reviewed changes

k8s-ci-robot added the lgtm "Looks good to me", indicates that a PR is ready to be merged. label Jan 15, 2025

k8s-ci-robot removed the lgtm "Looks good to me", indicates that a PR is ready to be merged. label Jan 15, 2025

jmdeal force-pushed the feat/termination-conditions branch from d536a96 to 43949ef Compare January 16, 2025 17:21

engedaam reviewed Jan 16, 2025

View reviewed changes

k8s-ci-robot added the lgtm "Looks good to me", indicates that a PR is ready to be merged. label Jan 16, 2025

jonathan-innis reviewed Jan 24, 2025

View reviewed changes

jmdeal force-pushed the feat/termination-conditions branch from 43949ef to 4ed938c Compare January 27, 2025 05:33

k8s-ci-robot removed the lgtm "Looks good to me", indicates that a PR is ready to be merged. label Jan 27, 2025

jmdeal force-pushed the feat/termination-conditions branch from 4ed938c to 015140d Compare February 10, 2025 19:05

k8s-ci-robot added size/XL Denotes a PR that changes 500-999 lines, ignoring generated files. and removed size/L Denotes a PR that changes 100-499 lines, ignoring generated files. labels Feb 10, 2025

jmdeal force-pushed the feat/termination-conditions branch 2 times, most recently from f4ba729 to 233f43a Compare February 10, 2025 19:15

k8s-ci-robot removed the do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. label Feb 10, 2025

jmdeal added 3 commits February 10, 2025 17:39

feat: drain and volume detachment status conditions

fd75ab1

fix: initialize fake clock in terminator test

7901a70

review

1ddb121

jmdeal force-pushed the feat/termination-conditions branch from 233f43a to 1ddb121 Compare February 11, 2025 01:39

jonathan-innis reviewed Feb 11, 2025

View reviewed changes

jmdeal mentioned this pull request Feb 12, 2025

feat: add expectations for reconciler requeue awslabs/operatorpkg#130

Merged

k8s-ci-robot added the needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. label Feb 21, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat: drain and volume detachment status conditions #1876

feat: drain and volume detachment status conditions #1876

jmdeal commented Dec 11, 2024 •

edited

Loading

coveralls commented Dec 11, 2024 •

edited

Loading

engedaam commented Dec 12, 2024

github-actions bot commented Jan 2, 2025

jmdeal commented Jan 11, 2025

jmdeal commented Jan 15, 2025

engedaam commented Jan 15, 2025

engedaam left a comment

k8s-ci-robot commented Jan 15, 2025

engedaam left a comment

jonathan-innis Jan 24, 2025

jmdeal Jan 24, 2025

jonathan-innis Jan 24, 2025

jonathan-innis Feb 11, 2025

jonathan-innis Jan 24, 2025

jmdeal Feb 10, 2025

k8s-ci-robot commented Jan 27, 2025

jmdeal commented Feb 10, 2025

jonathan-innis left a comment

jonathan-innis Feb 11, 2025

jonathan-innis Feb 11, 2025

jmdeal Feb 12, 2025

jonathan-innis Feb 11, 2025

jonathan-innis Feb 11, 2025

jmdeal Feb 12, 2025

jonathan-innis Feb 11, 2025

jonathan-innis Feb 11, 2025

jmdeal Feb 12, 2025

k8s-ci-robot commented Feb 21, 2025

feat: drain and volume detachment status conditions #1876

Are you sure you want to change the base?

feat: drain and volume detachment status conditions #1876

Conversation

jmdeal commented Dec 11, 2024 • edited Loading

coveralls commented Dec 11, 2024 • edited Loading

Pull Request Test Coverage Report for Build 13248558350

Warning: This coverage report may be inaccurate.

Details

💛 - Coveralls

engedaam commented Dec 12, 2024

github-actions bot commented Jan 2, 2025

jmdeal commented Jan 11, 2025

jmdeal commented Jan 15, 2025

engedaam commented Jan 15, 2025

engedaam left a comment

Choose a reason for hiding this comment

k8s-ci-robot commented Jan 15, 2025

engedaam left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

k8s-ci-robot commented Jan 27, 2025

jmdeal commented Feb 10, 2025

jonathan-innis left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

k8s-ci-robot commented Feb 21, 2025

jmdeal commented Dec 11, 2024 •

edited

Loading

coveralls commented Dec 11, 2024 •

edited

Loading